This workshop will provide an introduction to the programming language commonly referred to as R!
R is a popular programming language that many researchers use for organizing data, visualizing data, and carrying out statistical analyses.
By the end of this workshop series, our hope is that you will feel comfortable enough to work independently in R!
Before the workshop, we’ll need to download R and RStudio. Throughout the workshop, we’ll be working in RStudio, which will allow us to write code in R. So let’s make sure we have both R and RStudio installed before we begin!
Download an R CRAN Mirror, which basically just hosts the R programming language that we will be using in RStudio. https://cran.r-project.org/
Download RStudio, which is the main software that we will be using to work with R. https://posit.co/download/rstudio-desktop/
Download the intro-to-coding-2025 folder from the TU-COG Github page (https://github.com/TU-Coding-Outreach-Group/intro-to-coding-2025) by pressing the green Code button and downloading the ZIP folder. This is the folder containing all the files that we will be working with for the purposes of this workshop.
Open up a new R Markdown document by clicking File > New File > R Markdown. First time R users will be asked to download packages once they open up an R Markdown file. Click “Yes” to downloading those packages!
First things first: if you haven’t already done so, open R Studio. Next, you’re going to want to open a new R Markdown document by clicking File > New File > R Markdown…
This should produce a dialogue box where you can enter the title of the script and your name before selecting OK.
Next, let’s clear out all of the default text that appears in a new R Markdown document, which I have highlighted below:
In a typical coding script, every line must contain code that the language could interpret. If you want to include notes, you have to include a hash mark (#) before any code in order for the program to “ignore this line”, which can get a bit annoying. An R Markdown script is much more user friendly.
With R Markdown, any code that you would like R to interpret belongs in the coding chunk as illustrated below.
If we want to leave notes, we don’t have to “comment it out”. We can just write long-winded narration that can help others understand why we coded what we coded and what that code does.
That’s because a typical script will interpret any text as a command, unless the text is otherwise marked by a hashtag (#). An R markdown script only interprets things as code when we tell it to, and we tell it what is code by creating a chunk. Chunks are marked by three backticks (```) followed by a {r} and, on another line, three more backticks.
The output of any given chunk will appear just below that chunk,
rather than in the R Studio Console Window. By output, we just
mean the product, sum, or status of whatever calculation or item you are
asking R to compute and show you.
R Markdown grants us greater control over what we see and when we see it. To demonstrate, let’s start by creating a new chunk in our markdown document and entering what we see in the image above:
2 + 2
## [1] 4
With a typical script, if we want to know the output of a line we ran
awhile ago, we either have to rerun it or scroll through the console to
find it. With Markdown we can minimize entire chunks and their output by
using the minimization button [] on the left side of the window.
If we want to hide output, we can use the expand/collapse button
[] on the right
side of the output window.
We can choose exactly what we want to run using the the
“Run” command [] in the upper right corner of the chunk.
Also of note, the down-facing arrow (second icon in the upper right
corner of the code block) will tell R “Run all of the blocks of command
that I have before this block” []. It can be helpful if you make a
mistake and don’t want to manually rerun all of the previous blocks one
by one to get back to where you were. It also makes your code very easy
for other people to run. They can quite literally do it with the click
of a button!
If we click the cog icon in the same tray, we can access the output
options and manipulate where output appears and what it looks like, but
that’s beyond the scope of this review [].
I know this is a lot of information so don’t worry about remembering all of it right now – I’ll talk through some of the buttons we’ll be using later on when we start working through our R script!
Packages in R are synonymous with libraries in other languages. They are more or less convenient short-cuts or functions someone else already programmed to save us some work. Somebody else already figured out a very quick way to compute a function so now we don’t have to! We just use their tools to do it.
Every new package is centralized in R’s repository, so even though thousands of people are working on these things independently, you don’t need to leave R to find them. Before they can be used, they must be installed, and you can do that in a pretty simple way:
Importantly, once you install a package, you don’t need to install it again. R can remember which packages have been previously installed.
install.packages("PACKAGENAME")
If you need to update a package, you can just re-run the above code. If you’re using R Studio, you can also see a list of your packages and their associated descriptions in the ‘Packages’ Tab of your Viewer Window.
Now that we have installed a package, that doesn’t mean we can use it
yet. We need to tell R “We want access to the functions that this
package has during this session” by calling it with the
library() command.
library(PACKAGENAME)
Notice that we drop the quotation marks now. We just specify the (case-sensitive) package name and it lets R know we are planning on using that this session.
You might be wondering why we need to take this extra step. Sometimes different packages use the same commands, so having more than one of those active at the same time could confuse R (When this does happen, R will usually tell you). Sometimes packages take up a lot of disk space, so having ALL of your packages initialized at once might leave your computer running extremely slow. It’s the same for most languages.
If we ever want to explore the functions contained within a package in conjunction with examples, we can either go to the R documentation website or type ‘??PackageName’ into the Console, which will then populate the Help Tab of the Viewer Window with information on the package.
Lastly, although you only need to install a package once, you have to
load in any packages you want to use in a given R session. In other
words, after installing a package, you can skip the
install.packages() step moving forward, but you’ll still
need to run the library() step each time you want to use
the package in R.
Let’s try installing and loading in a few package for practice. Let’s install and load the following packages in R: tidyverse, report, lme4, and ggplot2
For today’s workshop, we’ll be working with the palmerpenguins dataset. Please make sure you have the 2 CSV files (palmer-penguins.csv and palmer-penguins_raw.csv) already downloaded!
The palmerpenguins data contains two datasets:
The curated data, which contains size measurements for 344 penguins from three penguin species observed on three islands in the Palmer Archipelago, Antarctica.
The raw data, which contains additional information about the penguins.
References palmerpenguins data originally published in:
Gorman KB, Williams TD, Fraser WR (2014). Ecological sexual dimorphism and environmental variability within a community of Antarctic penguins (genus Pygoscelis). PLoS ONE 9(3):e90081. https://doi.org/10.1371/journal.pone.0090081
Before we start talking about “working directories”, let’s quickly unpack what a “file path” is on a computer.
“File paths” is just a fancy way to refer to the folders and subfolders that exist on someone’s computer. Understanding how to navigate file paths is really important for being able to access specific files that you want to work with!
Here’s an example of a file path: “/Users/tuh20985/Documents/GitHub/intro-to-coding-2025”
And here’s a visualization of what that file path looks like on a computer:
Each part of the file path is a separate folder that we are traversing through.
File paths are especially important for setting your working directory and finding the data that you want to work with, just like we’ll need to do for this workshop.
In order to start working with the palmerpenguins datasets, we need to revisit the concept of “file paths”. We’ll use our understanding of file paths to set our “working directory”. At this point, some people may be wondering what a “working directory” is.
A working directory refers to the specific file path on your computer where R (or any programming language) will look for files by default. Like any other language or program, R needs to be told where the file that we’d like to work with is located on our computer. It doesn’t just know automatically where every file is on a person’s computer.
Below we’ll use the getwd() command to find out where
where my current working directory is.
getwd() #get your current working directory
## [1] "/Users/tuh20985/Documents/GitHub/intro-to-coding-2025"
As you can see, my working directory is currently set at: “/Users/tuh20985/Documents/GitHub/intro-to-coding-2025”
The output probably doesn’t make much sense to anyone because this working directory file path is specific to my computer! Each of our working directory file paths will be specific to the computer we are using.
In order to work with the palmerpenguins datasets, we need to tell R where the files are located. We can create a new object that stores a specific file path to make this process easier. File paths will differ based on whether you are using a Windows versus a Mac.
If you’re using a Windows computer, it’s likely your file path will exist within your “C:/ Drive”.
If you’re on a Mac, it’s likely your file path will start with a forward slash “/”.
# For Windows
Path <- "C:/"
# For Mac
Path <- "/"
On my computer, here’s where the palmerpenguins data exists:
In order to work with the palmer-penguins.csv file, I’ll need to set my working directory to this file path!
Let’s store this file path in an object called “Path”.
# For Windows
Path <- "C:/Users/tuh20985/Documents/GitHub/intro-to-coding-2025/data/"
# For Mac
Path <- "/Users/tuh20985/Documents/GitHub/intro-to-coding-2025/data/"
This format of assigning a value to an object, like we did with Path and “/Users/tuh20985/Documents/GitHub/intro-to-coding-2025/data/” is really important and we’ll keep coming back to it throughout this tutorial.
We’ll next use this Path object to set our working
directory using the setwd() command. The
setwd() command tells R where to look for files on our
computer.
setwd(Path) #use the setwd() function to assign the "Path" object that we created earlier as the working directory
Amazing! Now that our working directory is set to the correct file path, we can start load in the palmer-penguins data!
Before we load in the palmer-penguins data, I want to highlight a little terminology. The data that R works with is always contained within what we call a ‘dataframe’. A dataframe represents the same thing that a spreadsheet represents in Excel. It contains many cells that are situated into columns (which have names) and rows (which may or may not have names).
There are many ways to load data into R and they all depend upon what
format the data is in. R can handle data from .csv, .xlsx, .txt, .html,
.json, SPSS, Stata, SAS, among others. R also has it’s own data format
(.RDA, .Rdata). With the exception of .RDA, .csv is often the cleanest
means of reading in data. We won’t cover the other formats, but they are
fairly exhaustively covered
In the most basic sense, we can load our palmer-penguins.csv data file using the read.csv() function like this:
The read.csv() command loads in the data. If done
correctly, we should see our R Environment populate
with a dataframe labeled penguins.
penguins <- read.csv(file = "palmer-penguins.csv") #Load in the palmer-penguins CSV file and store it in a data frame called "penguins"
Since we’re all using the same dataset, the number of observations and variables should be the same as in the picture above. Here, you can think of observations as “rows” and variables as “columns”. If you click on penguins in the environment, it will open in a new tab of your Source Window (The same window you are likely writing script in) where you can view it.
We can also view the penguins data frame by using the View() command.
If we want to look at the first few rows of the penguins data frame, we can use the head() command.
View(penguins) #will open up the full data frame like you would in Excel
head(penguins) #will show you a subset of rows within the Data Frame
Amazing! Now that we can see the penguins dataset, let’s get a better idea of what each column in the dataset represents:
sample_number – Represents the continuous numbering sequence for each sample where data was collected
species – The three types of penguin species
island – The three types of island
bill_length_mm – A continuous number denoting bill length in millimeters
bill_depth_mm – An integer denoting bill depth in millimeters
flipper_length_mm – An integer denoting flipper length in millimeters
body_mass_g – A continuous number denoting body mass in grams
sex – Sex of the penguin
year – Year when the study took place
It’s super important to recognize the different data types that exist in R. Some of the common data types include: character, factor, double/numeric, integer.
You may have noticed that for some variables, like bill_length_mm, the column description says “continuous number”, whereas for other variables, like flipper_length_mm, the column variable says “integer”. This is because R uses different data types, which we’ll talk about more below.
Let’s use the glimpse() command from the tidyverse
library to print out the data type for each column in the penguins data
frame.
glimpse(penguins)
## Rows: 344
## Columns: 9
## $ sample_number <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1…
## $ species <chr> "Adelie", "Adelie", "Adelie", "Adelie", "Adelie", "A…
## $ island <chr> "Torgersen", "Torgersen", "Torgersen", "Torgersen", …
## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex <chr> "male", "female", "female", NA, "female", "male", "f…
## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
Here’s a quick glossary for the common data types that appear in R:
dbl references any number that reflects a continuous
number and can include decimals.
int reflects
integers, which only include whole numbers and do not include decimals.
chr reflects character, which includes
string-based text.
factor reflects factor, which
includes texts or numbers, but is typically used to reflect a category
(Think “Low”, “Medium”, “High” for texts or “1”, “2”, “3”, for
rankings).
Now that we’ve got a better understanding of what data can look like in R, let’s start working with the data.
By looking at the penguins data frame, we can see that we aren’t working with a perfectly clean dataset: some of the rows have missing data! And we may not need all of the columns in the dataframe.
So how do we work with specific rows? How do we work with specific columns? And how can we check what data is missing? Learning how to access specific elements of a data frame is an extremely important part of learning R.
dataframe$column will print out all the rows in that specific column. Let’s print out all the rows in the species column.
penguins$species
## [1] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [7] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [13] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [19] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [25] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [31] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [37] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [43] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [49] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [55] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [61] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [67] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [73] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [79] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [85] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [91] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [97] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [103] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [109] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [115] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [121] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [127] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [133] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [139] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [145] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [151] "Adelie" "Adelie" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [157] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [163] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [169] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [175] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [181] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [187] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [193] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [199] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [205] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [211] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [217] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [223] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [229] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [235] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [241] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [247] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [253] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [259] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [265] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [271] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [277] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [283] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [289] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [295] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [301] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [307] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [313] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [319] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [325] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [331] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [337] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [343] "Chinstrap" "Chinstrap"
What if we want to see a specific row? Let’s say row 2 within the species column? To reference a specific row in a given column, we can add brackets and the number of that row in the brackets.
The code below will print out the second row in the species column. We can manually confirm this by opening up the penguins data frame and looking at the second row in the species column.
penguins$species[2]
## [1] "Adelie"
However, we can also index the column using it’s relative position. Knowing that the species column is the first column, I can use bracket notation. Bracket notation is super helpful once you understand its structure. It helps me to think of it as [rows, columns]. Any number that appears before the comma will access rows, and any number that appears after the comma will access columns.
By including the name of the data frame before the bracket notation, we can pull certain rows and columns from that data frame
penguins[1,] # print the first row across all 8 columns
penguins[,2] # print all the rows for column 2
## [1] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [7] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [13] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [19] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [25] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [31] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [37] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [43] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [49] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [55] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [61] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [67] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [73] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [79] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [85] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [91] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [97] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [103] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [109] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [115] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [121] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [127] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [133] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [139] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [145] "Adelie" "Adelie" "Adelie" "Adelie" "Adelie" "Adelie"
## [151] "Adelie" "Adelie" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [157] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [163] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [169] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [175] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [181] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [187] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [193] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [199] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [205] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [211] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [217] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [223] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [229] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [235] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [241] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [247] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [253] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [259] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [265] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [271] "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo" "Gentoo"
## [277] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [283] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [289] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [295] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [301] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [307] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [313] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [319] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [325] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [331] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [337] "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap" "Chinstrap"
## [343] "Chinstrap" "Chinstrap"
penguins[1,2] # print the first row in column 2
## [1] "Adelie"
Now that we know how to access rows and columns, let’s talk about subsetting!
Subsetting is a technique for filtering rows or columns in a given data frame in order to remove any rows/columns you are not interested in using.
We can subset rows and columns, but let’s start off with focusing on how we can subset specific rows.
We know that there are three species of penguins in this dataset: Adelie, Gentoo, and Chinstrap.
Let’s say we only wanted to focus on penguins from the Adelie species.
Thankfully, the filter() function from the tidyverse
library makes subsetting a bit easier. Here, we will use the
filter() function to subset the rows that represent
penguins from the Adelie species and store these rows in a new data
frame called “penguins_adelie”.
penguins_adelie <- penguins %>%
filter(species == "Adelie")
If you open up the penguins_adelie dataframe using
View(penguins_adelie) or manually clicking on the new data
object in the R Environment window, you should see that all the rows in
the species column of this data frame reflect the
Adelie species.
Let’s break down 2 operators here that are particularly important.
%>% refers to piping, which lets you pass the result of one function directly as the input to the next function. Here, all we are doing is feeding in the penguins data frame into the filter() function.
Notice the two equals signs (==) in this part of the code:
filter(species == "Adelie")
When two value operators (=, >, <, !) are placed next to each other in R, and many other languages, we aren’t assigning a value to an object; we are comparing the values between two different objects. In this instance, using two equals signs, if the two values are equal, it would produce a TRUE value; if not, then a FALSE. This variable which can only take the value of either TRUE or FALSE is called a boolean. When we tell R to compare the value on the right with this specific column, what it is mechanically doing is iterating through each row within this column, comparing the column value, and determining whether the condition is TRUE or FALSE.
We can use this same strategy, piping (%>%) and value operators (==), to subset columns as well.
Let’s say we were only interested in examining bill length, and not as interested in examining bill depth, flipper length, body mass, sex, and year.
We can use the select() function from the tidyverse library to keep the sample_number, species, island, and bill_length_mm columns, while also removing the columns we are not interested in (bill_depth_mm, flipper_length_mm, body_mass_g, sex, and year).
Here, we will use the select() function to assign the
columns of interest (sample_number, species, island, bill_length_mm)
into a new data frame called “penguins_bill_length”.
penguins_bill_length <- penguins %>%
select(sample_number, species, island, bill_length_mm)
If you open up the penguins_bill_length dataframe using
View(penguins_bill_length) or manually clicking on
penguins_bill_length in the R Environment window, you should see that
only the columns we wanted to keep (sample_number, species, island,
bill_length_mm).
What if, rather than subsetting based on one condition (i.e., subset all rows that represent penguins from the Adelie species), we wanted to subset based on multiple conditions?
We can take advantage of the OR ( | ) operator using the filter() function.
Let’s say we were only interested in examining Adelie or Gentoo penguins, but not Chinstrap penguins.
We can use the OR operator to tell R to subset all rows where species is equal to “Adelie” OR “Gentoo” and store it in a new data frame called “penguins_adelie_gentoo”.
penguins_adelie_gentoo <- penguins %>%
filter(species == "Adelie" | species == "Gentoo")
If you open up the penguins_adelie_gentoo data frame, you can see that all the rows in the species column represent Adelie or Gentoo penguins, but not Chinstrap penguins
We can also take advantage of the AND ( & ) operator using the filter() function.
Let’s say we were only interested in examining penguins from the Adelie species AND penguins that lived on Biscoe island.
We can use the AND operator to tell R to subset all rows in the species column that are equal to “Adelie” AND all rows in the island column that are equal to “Biscoe” and store these rows in a new data frame called “penguins_adelie_biscoe”.
penguins_adelie_biscoe <- penguins %>%
filter(species == "Adelie" & island == "Biscoe")
If you open up the penguins_adelie_biscoe data frame, you can see that all the rows in the species column represent Adelie and all the rows in the island column represent Biscoe.
As you can see, leveraging the OR ( | ) and AND (
& ) operators in conjunction with the
`filter() function can be especially powerful.
Let’s try an exercise where we have to subset data!
1) Create a new data frame called “penguin_chinstrap” and subset all rows in the species column that represent penguins from the Chinstrap species.
'
'
What if we wanted to see which rows had missing values (e.g., NA) or not? In other words, what if some penguins were missing information about their bill length?
Let’s go back to our original “penguins” data frame.
We can use the is.na() function to determine which rows
have missing values in the bill_length_mm column.
is.na(penguins$bill_length_mm)
## [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [181] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [193] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [205] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [217] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [229] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [241] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [253] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE
## [277] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [289] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [301] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [313] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [325] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [337] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
This will produce an array of TRUEs and FALSEs of the same length as the rows in the dataframe, because each TRUE and FALSE is telling us whether each row value in that column meets the condition we defined.
For example, we see that the first output is FALSE, which means that the first row in the bill_length_mm column does NOT have a missing value. We can confirm this by opening up the penguins data frame and going to the first row in the bill_length_mm column and seeing that it is not empty.
However, we see that the 4th output is TRUE, which means that the fourth row in the bill_length_mm column DOES have a missing value. We can confirm this by opening up the penguins data frame and going to the fourth row in the bill_length_mm column and seeing that it is indeed missing.
But how can we remove missing data (i.e., rows that are blank or have an ‘NA’ in it) from a data frame?
We can use the drop_na() function from the tidyverse library to create a new data frame called “penguins_complete” that only includes rows with no missing values in any columns.
penguins_complete <- penguins %>% drop_na()
What if, instead of removing rows that have a missing value in ANY column, we wanted to remove any rows that have a missing value in ONE specific column? We can again use the drop_na() function and specify the specific column that we want to remove rows with missing values from.
Let’s create a new data framed called “penguins_bill_length_complete” that only includes rows with no missing values in the bill_length_mm column.
penguins_bill_length_complete <- penguins %>% drop_na(bill_length_mm)
As you can see, the penguins_bill_length_complete data frame only includes rows that do not have a missing value in the bill_length_mm column, but still have missing values in other columns (e.g., sex)
An if-else statement is a powerful control flow tool that allows a computer to make decisions and execute different code paths based on specific conditions.
It works kind of like a fork in the road: if a certain condition is true, the program follows one path of code; if the condition is false, it follows an alternative code path.
It can be helpful to think of it as a digital version of decision-making -— “If it’s raining, I’ll take an umbrella; else, I’ll leave without one.
Here’s an example of what a general if-else expression looks like in R. This is similar to the fork/umbrella analogies that we discussed earlier.
if (condition) {
expression
} else {
expression
}
We can map this structure onto R’s ifelse() function,
which makes writing if-else statements a bit easier.
Let’s use an if-else statement to create a new column that represents whether an island existed in the Eastern or Western hemisphere of the world.
Let’s say that the Biscoe island existed in the Eastern hemisphere, and the Dream and Torgersen islands existed in the Western hemisphere.
The structure for ifelse() statements is as follows: if the value in the island column has a cell that is equal to “Biscoe”, R will insert a value of “Eastern” in the new hemisphere column for that corresponding cell, else, insert a value of “Western”.
We can use the mutate() function from the tidyverse to
add the new “hemisphere” column to our penguins data frame.
# Use mutate() with ifelse() to add a new column called "hemisphere" that reflects whether the island was located on the Eastern or Western hemisphere
penguins <- penguins %>%
mutate(hemisphere = ifelse(penguins$island == "Biscoe", "Eastern", "Western"))
Here’s what our code would look like within the general if-else expression that we showed above.
#General structure of if statement
if (penguins$island == "Biscoe") {
penguins$hemisphere <- "Eastern"
} else {
penguins$hemisphere <- "Western"
}
We can also use a for-loop to accomplish the same outcome that we had
when using the ifelse() function.
A for-loop is one of the main control-flow constructs of the R programming language. It is used to iterate over a collection of objects, such as a vector, a list, a matrix, or a dataframe, and apply the same set of operations on each item of a given data structure.
For illustrative purposes, rather than overwriting the “hemisphere” column, let’s make a new column called “hemisphere_loop”
for (i in 1:length(penguins$island)) {
if (penguins$island[i] == "Biscoe") {
penguins$hemisphere_loop[i] <- "Eastern"
} else {
penguins$hemisphere_loop[i] <- "Western"
}
}
Lets break this for-loop code down in some more detail.
1) for (i in 1:length(penguins$island)) { — “i” is a temporary variable that store the values of the current position in the range of the for loop. In this case, we are telling R that we want “i” to represent each row within the length of the island column, starting at row 1 and going all the way down until the last row in the island column. “i” will iterate across each of these rows.
2) if (penguins$island[i] == “Biscoe”) { — This if statement is saying: if the value of i in the specific iteration of the island row is equal to “Biscoe”. Think about it as “if the first cell in the island row is equal to”Biscoe”, then if the second cell in the island row is equal to “Biscoe”, then if the third cell in the island row is equal to “Biscoe”, and so on..
3) penguins$hemisphere_loop[i] <- “Eastern” — input a value of “Eastern” into the corresponding cell of the new “hemisphere_loop” column.
4) } else { — the initiation of the else condition
5) penguins$hemisphere_loop[i] <- “Western” – input a value of “Western” into the corresponding cell of the new “hemisphere_loop” column
Merging data frames allows us to combine different datasets based on common variables. In other words, merging data together allows us to combine information from multiple sources into a single, comprehensive dataset.
To illustrate why merging is especially powerful, let’s start by reading in the palmer-penguins_raw CSV file, which contains additional information about the penguins.
penguins_raw <- read.csv(file = "palmer-penguins_raw.csv") #Load in the palmer-penguins_raw CSV file and store it in a data frame called "penguins_raw"
In the original penguins dataset that we’ve been working with, we can examine bill length (bill_length_mm) and flipper length (flipper_length_mm), but let’s say we were really interested in examining culmen (the ridge along the top part of a penguin’s bill) length. As we can see, the original penguins dataset does not have this information, but the palmer-penguins raw dataset does!
We can combine these two data frames into one data frame, so that we have all the information in one place, and so that we can examine culmen length.
Let’s use the merge() function to merge the original
penguins data frame and the penguins_raw data frame into a new data
frame called “penguins_merged”.
Here’s the general structure of the merge() function.
1) x and y: The two data frames you want to merge.
2) by: The column(s) to merge on when the column name is the same in both data frames.
penguins_merged <- merge(penguins, penguins_raw, by=c("sample_number", "species", "island", "flipper_length_mm", "body_mass_g", "sex"))
In our code, x and y represent the data frames we want to merge. All of the columns listed in the by=c expression represent the columns that are shared across both data frames (i.e., species, bill_length_mm, flipper_length_mm, body_mass_g, sex), so that R knows which columns we want to match across.
In this way, R will then be able to match the columns that are NOT shared across the two data frames (i.e., bill_length_mm, bill_depth_mm, year, stage, clutch_competition, date_eg, culmen_length_mm, culmen_depth_mm, delta_15_n, delta_13_c, comments).
Now we have all the information we need in one place!
Next we will learn how to pivot data from wide to long, and long to wide.
Before we get started with pivoting, let’s subset a smaller data frame just to make it easier to work with.
Let’s say we only wanted to focus on the length of different anatomical regions (i.e., bill, flipper, culmen). We can subset a data frame with the following columns: sample_number, species, island, bill_length_mm, flipper_length_mm, culmen_length_mm, and sex and store these columns in a new data frame called “penguins_length”.
penguins_length <- penguins_merged %>%
select(sample_number, species, island, bill_length_mm, flipper_length_mm, culmen_length_mm, sex)
Perfect! Next let’s break down what it means to pivot data.
When data exists in a “wide” format, that means that each individual only has 1 row.
When data exists in a “long” format, that means that each individual has more than 1 row (i.e., repeated-measures).
For the purposes of understanding how to pivot data frames, we will be using the penguins_length data frame that we just created.
In order to do many common statistical analyses, data needs to be in long format. For example, let’s say we wanted to show how bill, flipper, and culmen length differ across penguin species.
If we wanted to make a graph to visualize this comparison (bill length vs flipper length vs culmen length), we could use the species column as our x-variable, but what would our Y variable be? There are 3 columns (1 for bill_length_mm, 1 for flipper_length_mm, and 1 for culmen_length_mm), but there is only 1 Y-variable. So as the data currently stands, we could not create a graph that shows these comparisons! We must pivot.
The pivot_longer() function from the tidyr package in R
can be used to pivot a data frame from a wide format to a long
format.
penguins_long <- penguins_length %>% pivot_longer(
cols=c("bill_length_mm", "flipper_length_mm", "culmen_length_mm"), #The names of the columns to pivot
names_to = "measure", #The name for the new character column
values_to = "length") #The name for the new values column
Here, we did a basic pivot longer to convert the data from wide to long, where the new column that we created (measure) reflects the column names that we included in the cols=c argument of the pivot_longer function. The other new column that we created, length, reflects the values that existed in the columns that we included in the cols=c argument of the pivot_longer function.
This data frame is in long format because each penguin has 3 rows (1 for bill length, 1 for flipper length, and 1 for culmen length)
Pivoting data can be a bit tricky to understand at first, but hopefully this example helps!
Data can also be converted from long to wide!
The pivot_wider() function from the tidyr package in R
can be used to pivot a data frame from long format to wide format.
Let’s work on pivoting the long data frame that we created back to its original wide format!
#Pivot wider
penguins_wide <- penguins_long %>%
pivot_wider(names_from = measure, #names_from: The column whose values will be used as column names
values_from = length) #values_from: The column whose values will be used as cell values
Here, we converted the data frame from long back to its original wide format! We know it’s in wide format because each penguin now only has 1 row.
By including the “measure” column in the “names_from” column, we are telling R that this is the column that will be used to generate column names. By including the “length” column in the “values_from” column, we are telling R that this is the column whose values will be used to generate row values.
Let’s try an exercise where we pivot data from wide to long!
1) Subset the following columns from the penguin_length data frame and store them in a new data frame called “penguin_pivot_exercise”: sample_number, species, bill_length_mm, bill_depth_mm
2) Pivot the data frame from wide to long and store it in a a new data frame called “penguin_pivot_exercise_long”
3) You should end up with 4 columns: sample_number, species, bill_measure, length
'
'
Next, we are going to start talking about different types of statistical analyses in R!
We’re going to focus on three common statistical analyses: t-tests, ANOVAs, and linear regressions.
We’ll continue our focus on comparing the length of different anatomical parts (bill, flipper, and culmen) across penguin species in our statistical analyses below!
For our analyses, let’s use the penguins_length data frame that we created earlier which includes the following columns: sample_number, species, island, bill_length_mm, flipper_length_mm, culmen_length_mm.
To make things easier for our analyses, let’s remove any rows with
missing values in the data using the drop_na()
function.
penguins_length <- penguins_length %>% drop_na()
Now our data is ready to be analyzed!
A T-Test can be used when both the predictor variable consists of two categorical options and the outcome or dependent variable is numeric in value. A T-Test tells you how significant the differences between these categories or groups are. In other words, it lets you know if the differences between the means of two groups could have observed by chance. We could imagine a situation where an evil teacher told half of the class before a test the right chapter to study from and told the other half of the class the wrong chapter to study from. The two categories or groups might be Right Chapter and Wrong Chapter and the outcome variable would be Test Score. Using a T-Test, we could determine whether studying from the right chapter produces higher test scores.
Below, we propose a research question and t-test analysis for our penguins dataset.
QUESTION: Do Adelie and Gentoo penguins differ in flipper length?
HYPOTHESIS: On average, Adelie and Gentoo penguins differ in flipper length.
RELEVANT VARIABLES: Dependent variable: flipper_length_mm (numeric), Independent variable: species (Factor)
ANALYSIS: Unpaired Two-Samples T-Test
Now let’s do the t-test!
We’ll need to use conditional statements again to specify our variables. What we are comparing here are the mean values of flipper length for Adelie and Gentoo penguins. As such, we are going to specify we want flipper length when species == “Adelie” and when species == “Gentoo. We next have an argument which asks us whether this analysis is within-subjects or between-subjects. This question is between-subjects, since each penguin is either from the Adelie species or the Gentoo species, so we’ll mark that as FALSE. Lastly, R is asking us to define our alternative hypothesis, which is a little beyond the scope of this review, so you will have to take my word that “two.sided” is the right call.
model1 <- t.test(x = penguins_length$flipper_length_mm[penguins_length$species == "Adelie"],
y = penguins_length$flipper_length_mm[penguins_length$species == "Gentoo"],
paired = FALSE,
alternative = "two.sided")
#Print the model results
model1
##
## Welch Two Sample t-test
##
## data: penguins_length$flipper_length_mm[penguins_length$species == "Adelie"] and penguins_length$flipper_length_mm[penguins_length$species == "Gentoo"]
## t = -33.506, df = 251.35, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -28.72740 -25.53771
## sample estimates:
## mean of x mean of y
## 190.1027 217.2353
We can observe the results of our t-test by printing the
model1 data object that we created. However, R has the
report() package which makes interpreting results a little
easier!
Let’s use the report() function from the report library
to help with our interpretation of the t-test results.
report(model1)
## In the report() function, for htest objects, you can try providing the
## data argument manually, e.g., report(x, data = data).
## Warning: Unable to retrieve data from htest object.
## Returning an approximate effect size using t_to_d().
## Effect sizes were labelled following Cohen's (1988) recommendations.
##
## The Welch Two Sample t-test testing the difference between
## penguins_length$flipper_length_mm[penguins_length$species == "Adelie"] and
## penguins_length$flipper_length_mm[penguins_length$species == "Gentoo"] (mean of
## x = 190.10, mean of y = 217.24) suggests that the effect is negative,
## statistically significant, and large (difference = -27.13, 95% CI [-28.73,
## -25.54], t(251.35) = -33.51, p < .001; Cohen's d = -4.23, 95% CI [-4.67,
## -3.78])
So it looks like our hypothesis panned out! We see statistically significant differences, judging by the model output (t(251.35) = -33.51, p-value < .001). Since penguins from the Adelie species were our reference group, the negative value of the T-statistic indicates that the first group (Adelie) has a lower mean than the second group (Gentoo). In other words, on average, penguins from the Gentoo species have a longer flipper length than penguins from the Adelie species.
Next we’ll talk about running an ANOVA test!
An ANOVA, or Analysis of Variance, can be used when both the predictor variable or variables consist of two or more categorical options and the outcome or dependent variable is numeric in value. Much like a T-Test, ANOVA tells you how significant the differences between these categories or groups are. The advantage over T-tests is that we can compare multiple groups or categories in one analysis. We could revisit our last horrible example and imagine that the evil teacher tells one group right chapter to study from, one group the wrong chapter to study from, and one group to not study at all. An ANOVA test will tell us whether any of these three groups are different from one another (but not necessarily which specific groups are different from one another).
Using the penguins_merged dataframe, we’ll run a similar analysis that we did for the t-test, but this time, rather than only using two penguin species (Adelie and Gentoo), we’ll include all three penguin species (Adelie, Gentoo, and Chinstrap).
QUESTION: Does bill length differ across penguin species (i.e., Adelie, Gentoo, Chinstrap)?
HYPOTHESIS: There are differences in bill length across penguin species.
RELEVANT VARIABLES: Dependent variable: bill_length_mm (numeric) Independent variable: species (Factor)
ANALYSIS: ANOVA
Pay close attention to the formatting of the syntax here. It is the standard way in which we specify most statistical models in R, whether for regression, ANOVA, hierarchical modeling etc.
model2 <- aov(bill_length_mm ~ species, data = penguins_length) #create ANOVA model and store in an object called model2
#print summary of model results
summary(model2)
## Df Sum Sq Mean Sq F value Pr(>F)
## species 2 7015 3508 397.3 <2e-16 ***
## Residuals 330 2914 9
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The report() function works equally well on ANOVA
models, too.
report(model2)
## The ANOVA (formula: bill_length_mm ~ species) suggests that:
##
## - The main effect of species is statistically significant and large (F(2, 330)
## = 397.30, p < .001; Eta2 = 0.71, 95% CI [0.67, 1.00])
##
## Effect sizes were labelled following Field's (2013) recommendations.
It seems that the main effect of species has a significant effect on the length of bills, judging by the model output (F(2, 330) = 397.30, p-value < .001)
Next, we’ll switch to linear regressions!
A bivariate linear regression can be used when both the predictor variable (X) and outcome variable (Y) consist of continuous numeric values. A linear regression tells us how predictive of Y that X is. In other words, if we measured temperature and ice cream sales, we might find, using linear regression that as temperature increases, we could predict with decent accuracy that ice cream sales would increase as well, and we could predict how many ice cream sales we expect to see for any one value of temperature.
Within the context of the penguins dataset, we could ask a question like whether the length of penguins’ bills predicts the length of their flippers.
QUESTION: Does bill length predict flipper length?
HYPOTHESIS: As bill length increases, flipper length also increases.
RELEVANT VARIABLES: Dependent variable: Flipper length (numeric) Independent variable: Bill length (numeric)
ANALYSIS: Bivariate Linear Regression
Okay now it’s time to run our regression!
You’re going to notice right off the bat that the structure of the syntax here looks awfully similar to what we just did in ANOVA. We start by noting our method with the lm() function (Linear Modeling). We then note our outcome variable, add a tilde (~), note our predictor(s), and finally note our datasource.
#Store bivariate linear regression in an object called "m1"
m1 <- lm(flipper_length_mm ~ bill_length_mm, data = penguins_length)
Just like ANOVA, we need to use the summary() function to read the data.
#Use summary() function to print summary for m1 bivariate linear model
summary(m1)
##
## Call:
## lm(formula = flipper_length_mm ~ bill_length_mm, data = penguins_length)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.413 -7.837 0.652 8.360 21.321
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 127.3304 4.7291 26.93 <2e-16 ***
## bill_length_mm 1.6738 0.1067 15.69 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.63 on 331 degrees of freedom
## Multiple R-squared: 0.4265, Adjusted R-squared: 0.4248
## F-statistic: 246.2 on 1 and 331 DF, p-value: < 2.2e-16
Lastly, we can run the report() function for linear regression models too!
#Use report() function to print report for m1 bivariate linear model
report(m1)
## We fitted a linear model (estimated using OLS) to predict flipper_length_mm
## with bill_length_mm (formula: flipper_length_mm ~ bill_length_mm). The model
## explains a statistically significant and substantial proportion of variance (R2
## = 0.43, F(1, 331) = 246.19, p < .001, adj. R2 = 0.42). The model's intercept,
## corresponding to bill_length_mm = 0, is at 127.33 (95% CI [118.03, 136.63],
## t(331) = 26.92, p < .001). Within this model:
##
## - The effect of bill length mm is statistically significant and positive (beta
## = 1.67, 95% CI [1.46, 1.88], t(331) = 15.69, p < .001; Std. beta = 0.65, 95% CI
## [0.57, 0.73])
##
## Standardized parameters were obtained by fitting the model on a standardized
## version of the dataset. 95% Confidence Intervals (CIs) and p-values were
## computed using a Wald t-distribution approximation.
Linear regression results show that bill length statistically predicts flipper length, as observed by model output stats: (beta = 1.67, 95% CI [0.57, 0.73], t(331) = 15.69, p < .001).
Whereas bivariate linear regression uses 1 independent predictor and 1 dependent variable (hence, “bi”), a multivariate linear regression builds upon bivariate regression by allowing for more than 1 predictors (hence “multi”). If we measured temperature, ice cream sales, and the time since someone last ate, we might find that our previously specified model is now even more accurate because of the addition of the “time since last ate” variable.
Using the penguins data, let’s examine if bill length still predicts flipper length when controlling for differences in sex.
QUESTION: Does bill length predict flipper length when accounting for the effect of sex?
HYPOTHESIS: Bill length still predicts flipper length when controlling for sex.
RELEVANT VARIABLES: Dependent variable: flipper length (numeric) Independent variables: Bill length (numeric) and sex (factor)
ANALYSIS: Multivariate Linear Regression
#Store multivariate linear regression in an object called "m2
m2 <- lm(flipper_length_mm ~ bill_length_mm + sex, data = penguins_length)
Just like the bivariate linear regression, we need to use the summary() function to read the data.
#Use summary() function to print summary for m2 multivariate linear model
summary(m2)
##
## Call:
## lm(formula = flipper_length_mm ~ bill_length_mm + sex, data = penguins_length)
##
## Residuals:
## Min 1Q Median 3Q Max
## -42.498 -8.227 0.589 8.401 20.983
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 128.1827 4.8568 26.392 <2e-16 ***
## bill_length_mm 1.6434 0.1137 14.456 <2e-16 ***
## sexmale 0.9669 1.2416 0.779 0.437
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.64 on 330 degrees of freedom
## Multiple R-squared: 0.4276, Adjusted R-squared: 0.4241
## F-statistic: 123.3 on 2 and 330 DF, p-value: < 2.2e-16
Lastly, we can run the report() function for linear
models as well!
#Use report() function to print report for m2 multivariate linear model
report(m2)
## We fitted a linear model (estimated using OLS) to predict flipper_length_mm
## with bill_length_mm and sex (formula: flipper_length_mm ~ bill_length_mm +
## sex). The model explains a statistically significant and substantial proportion
## of variance (R2 = 0.43, F(2, 330) = 123.25, p < .001, adj. R2 = 0.42). The
## model's intercept, corresponding to bill_length_mm = 0 and sex = female, is at
## 128.18 (95% CI [118.63, 137.74], t(330) = 26.39, p < .001). Within this model:
##
## - The effect of bill length mm is statistically significant and positive (beta
## = 1.64, 95% CI [1.42, 1.87], t(330) = 14.46, p < .001; Std. beta = 0.64, 95% CI
## [0.55, 0.73])
## - The effect of sex [male] is statistically non-significant and positive (beta
## = 0.97, 95% CI [-1.48, 3.41], t(330) = 0.78, p = 0.437; Std. beta = 0.07, 95%
## CI [-0.11, 0.24])
##
## Standardized parameters were obtained by fitting the model on a standardized
## version of the dataset. 95% Confidence Intervals (CIs) and p-values were
## computed using a Wald t-distribution approximation.
Linear regression results show that bill length still statistically predicts flipper length when accounting for differences in sex, as observed by model output stats for the effect of bill length: (beta = 1.64, 95% CI [1.42, 1.87], t(330) = 14.46, p < .001).
Hopefully these analyses have made sense! Our hope is that you can use this analysis code to conduct your own analyses in your own research!
Next we’ll start talking about how to make graphs in R!
ggplot2 is an extremely popular plotting package in R that makes it fairly easy to create complex plots from data in a data frame.
ggplot2 refers to the name of the package itself, whereas we use the
ggplot() function to generate the plots. We’re going to
start off with building a very simple plot, and then we will add in some
more lines to organize a plot like you would for a
manuscript/publication!
We can start off by generating a histogram to visualize our dataset
and see the distribution of flipper length and color code it by penguin
species. The geom_histogram() argument is what allows us to
make the histogram!
ggplot(penguins_length, aes(x=flipper_length_mm, fill=species)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Histograms are a very helpful way to visualize data and better understand the distribution of specific variables of interest, and can also be helpful for visually identifying outliers.
Next, we’ll move onto creating bar plots and adding standard error bars to graphs!
Let’s plot the ANOVA we ran earlier, where we examined whether bill length differed across the three penguin species.
In order to add standard error bars to a plot, we first need to run
the summarySE() function that will calculate the standard
error for our dependent variable (bill length).
As a quick reminder, standard error bars reflect the uncertainty or variability in the mean value that’s being displayed. In other words, they indicate how precisely we know the true population mean - a smaller standard error suggests we can be more confident in our estimate, while larger error bars indicate more uncertainty. The standard error is calculated by dividing the standard deviation by the square root of the sample size.
The summarySE() function is appropriate when working
with between-subjects variables. If you have within-subjects variables
and want to adjust the error bars so that inter-subject variability is
removed as suggested in Loftus and Masson (1994), then the other two
functions, normDataWithin() and summarySEwithin() must also be added to
your code; summarySEwithin would then be the function that you call.
You don’t have to change anything in this code – just run it as it
is! All this code is doing is authorizing the use of the
summarySE() function.
## Gives count, mean, standard deviation, standard error of the mean, and confidence interval (default 95%).
## data: a data frame.
## measurevar: the name of a column that contains the variable to be summariezed
## groupvars: a vector containing names of columns that contain grouping variables
## na.rm: a boolean that indicates whether to ignore NA's
## conf.interval: the percent range of the confidence interval (default is 95%)
summarySE <- function(data=NULL, measurevar, groupvars=NULL, na.rm=FALSE,
conf.interval=.95, .drop=TRUE) {
library(plyr)
# New version of length which can handle NA's: if na.rm==T, don't count them
length2 <- function (x, na.rm=FALSE) {
if (na.rm) sum(!is.na(x))
else length(x)
}
# This does the summary. For each group's data frame, return a vector with
# N, mean, and sd
datac <- ddply(data, groupvars, .drop=.drop,
.fun = function(xx, col) {
c(N = length2(xx[[col]], na.rm=na.rm),
mean = mean (xx[[col]], na.rm=na.rm),
sd = sd (xx[[col]], na.rm=na.rm)
)
},
measurevar
)
# Rename the "mean" column
datac <- plyr::rename(datac, c("mean" = measurevar))
datac$se <- datac$sd / sqrt(datac$N) # Calculate standard error of the mean
# Confidence interval multiplier for standard error
# Calculate t-statistic for confidence interval:
# e.g., if conf.interval is .95, use .975 (above/below), and use df=N-1
ciMult <- qt(conf.interval/2 + .5, datac$N-1)
datac$ci <- datac$se * ciMult
return(datac)
}
Now that we have the summarySE() function loaded in, we
can calculate the standard error of bill length for each penguin
species.
Below, we will use the summarySE() function where the
measurevar argument reflects our dependent variable
(bill_length_mm) and the groupvars argument reflects
our independent variable (species).
# summarySE provides the standard deviation, standard error of the mean, and a (default 95%) confidence interval
penguins_length.se <- summarySE(penguins_length, measurevar="bill_length_mm", groupvars=c("species"))
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## The following object is masked from 'package:purrr':
##
## compact
#We can print the new data frame that we created to see what information is stored there
print(penguins_length.se)
## species N bill_length_mm sd se ci
## 1 Adelie 146 38.82397 2.662597 0.2203581 0.4355288
## 2 Chinstrap 68 48.83382 3.339256 0.4049443 0.8082721
## 3 Gentoo 119 47.56807 3.106116 0.2847372 0.5638571
Great! We can see that the penguins_length.se data frame includes the mean, standard deviation, standard error, and confidence intervals for our dependent variable of interest (bill length) across the three penguin species.
Now that we have the standard errors calculated, let’s make a bar plot with standard errors to compare bill length across the three penguin species.
ggplot(penguins_length.se, aes(x = species, y = bill_length_mm, fill = species)) + #Plot the variables using the penguins_length.se data frame
geom_bar(stat="identity") + #Use the geom_bar() function to generate a bar plot
geom_errorbar(aes(ymin=bill_length_mm-se, ymax=bill_length_mm+se), #Use the geom_errorbar() function to plot the standard error bars
width=.2, # Define the width of the error bars
position=position_dodge(.9))
As a visual confirmation of our ANOVA results, we can see that there are indeed differences in bill length across the penguin species – most notably with the bill length of the Gentoo and Chinstrap penguins differing from the bill length of Adelie penguins.
Next, we’ll visualize the bivariate linear regression that we ran above, where we examined whether bill length predicted flipper length in penguins.
We can use a scatterplot to visualize this relationship and add a linear regression line to the plot.
ggplot(penguins_length, aes(x=bill_length_mm, y=flipper_length_mm)) + #Plot the variables we are interested in
geom_point() + #geom_point() is the function used to create scatter plots
geom_smooth(method=lm, color="black") #Adds a linear regression line to your plot
## `geom_smooth()` using formula = 'y ~ x'
Amazing! Now we have a visual representation of what our bivariate linear regression results look like. It’s clear that as bill length increases, so does flipper length.
Hopefully this part has been helpful for learning how to visualize data!
So far, we’ve covered quite a lot in separate sections! To end the R workshop, we wanted to put together a complete pipeline that you can use to analyze your own data in the future, starting with loading R packages, reading data into R, cleaning the data up, running a t-test, and then visualizing the results.
Let’s say our goal was to examine whether bill length significantly differs from flipper length in penguins.
Here’s what a complete pipeline could look like to address this question using the tidyverse library in R!
#Step 1: Load the R packages you will be working with (this assumes that you've previously installed the R packages).
library(tidyverse)
library(report)
library(lme4)
library(ggplot2)
#Step 2: Read the data into R
penguins <- read.csv(file = "palmer-penguins.csv") #Load in the penguins CSV file and store it in a data frame called "penguins"
#Step 3: Subset the columns you are interested in using the select() function
penguins_length_pipeline <- penguins %>%
select(sample_number, species, island, bill_length_mm, flipper_length_mm)
#Step 4: Pivot data frame from wide to long
penguins_pipeline_long <- penguins_length_pipeline %>% pivot_longer(
cols=c("bill_length_mm", "flipper_length_mm"), #The names of the columns to pivot
names_to = "anatomical_region", #The name for the new character column
values_to = "length") #The name for the new values column
#Step 5: Clean your data and remove any rows with missing values
penguins_pipeline_long <- penguins_pipeline_long %>% drop_na()
#Step 6: Run a t-test to determine whether bill length significantly differs from flipper length
model2 <- t.test(x = penguins_pipeline_long$length[penguins_pipeline_long$anatomical_region == "bill_length_mm"],
y = penguins_pipeline_long$length[penguins_pipeline_long$anatomical_region == "flipper_length_mm"],
paired = FALSE,
alternative = "two.sided")
#Step 7: Interpret results of t-test output
report(model2)
#Step 8: Calculate the standard errors before plotting
#summarySE provides the standard deviation, standard error of the mean, and a (default 95%) confidence interval
penguins_pipeline_long.se <- summarySE(penguins_pipeline_long, measurevar="length", groupvars=c("anatomical_region"))
#Step 9: Create bar plot with standard error bars to visualize t-test
ggplot(penguins_pipeline_long.se, aes(x = anatomical_region, y = length, fill = anatomical_region)) +
geom_bar(stat="identity") +
geom_errorbar(aes(ymin=length-se, ymax=length+se),
width=.2,
position=position_dodge(.9))
#Step 10: All done!!!
Congrats! You made it through an R workshop! Hopefully you feel like you’re starting to pick up on some things! It’s also totally okay if things aren’t clicking just yet, but regardless of how you may be feeling, I will always be happy to work with you if you have any specific questions as you move forward on your coding journey. My email is stevent.martinez@temple.edu or you can always send me a Slack!
[What people think coding is versus what it actually is]